library(magrittr)
library(tidyverse)
library(limma)
library(edgeR)
library(AnnotationHub)
library(fgsea)
library(msigdbr)
library(goseq)
library(scales)
library(DT)
library(ggrepel)
library(ggeasy)
library(cqn)
library(ggfortify)
library(cowplot)
library(pathview)
library(pander)
library(ggpubr)
library(viridis)
library(pheatmap)
library(fgsea)
library(biomaRt)
library(Gviz)
library(BSgenome.Drerio.UCSC.danRer11)
theme_set(theme_bw())
This analysis will be doing a differential gene expression and differential pathway analysis from fmr1 knockout larvae at 2dpf. Previously, Stephen Pederson and Melanie Hand did a DE analysis using limma. Here, I will do one with the generalised linear model (GLM) capablities of edgeR as this, in my experience, detects more DE genes. I then will do some pathway analysis on DE genes (if we get enough), and also the entire dataset using ranked lists.
This was taken from FMR1_analysis.Rmd
Prior to count-level analysis, the initial dataset was pre-processed using the following steps:
After trimming alignment was performed using STAR v2.5.3a to the Danio rerio genome included in Ensembl Release 94 (GRCz11). Aligned reads were counted for each gene if the following criteria were satisfied:
Counts were then loaded and set up as a DGEList. During loading genes were only retained if receiving more than one counted alignment in at least 4 samples.
ah <- AnnotationHub() %>%
subset(species == "Danio rerio") %>%
subset(dataprovider == "Ensembl") %>%
subset(rdataclass == "EnsDb")
ensDb <- AnnotationHub()[["AH64906"]]
genes <- genes(ensDb)
x <- here::here("2_alignedData/featureCounts/genes.out") %>%
read_tsv() %>%
as.data.frame() %>%
column_to_rownames("Geneid") %>%
set_colnames(basename(colnames(.))) %>%
set_colnames(str_remove(colnames(.), "_Aligned.+")) %>%
magrittr::extract(rowSums(cpm(.) > 1) >= 4,) %>%
DGEList(
samples = tibble(
sample = colnames(.),
group = case_when(
sample %in% paste0("S", 1:8) ~ "FMR1",
!sample %in% paste0("S", 1:8) ~ "Wildtype"
)) %>%
mutate(group = factor(group, levels = c("Wildtype", "FMR1"))) %>%
as.data.frame(),
genes = genes[rownames(.)] %>%
as.data.frame() %>%
mutate(location = paste0(seqnames, ":", start, "-", end, ":", strand)) %>%
dplyr::select(
ensembl_gene_id = gene_id,
location,
description,
external_gene_name = gene_name
)
) %>%
calcNormFactors(method = "TMM") %>%
estimateDisp()
## Design matrix not provided. Switch to the classic mode.
The overall similarity between samples can be explored by principal component analysis (PCA). PCA is a dimension reducing technique which uses linear combinations of the original data (i.e. the logCPMs of the genes in the RNA-seq dataset) to define a new set of variables (principal components).
Samples which have similar gene expression profiles will cluster together in a plot of principal components (PC) PC1 vs PC2. In the graph below, we see some seperation of samples by genotype across PC1. This means that fmr1 genotype is the largest source of variation and explains ~35% of the variance. Sample S5 doesnt group with the rest of the fmr1 samples, so it is likely an outlier. I will do a DE and pathway analysis including it.
pca_logCPM <- x$counts %>%
cpm(log = TRUE) %>%
t() %>%
prcomp()
pca_logCPM %>%
autoplot(data = x$samples, colour = "group", size = 4) +
geom_text_repel(
aes(colour = group, label = sample),
show.legend = FALSE
) +
labs(colour = "Genotype")
PCA analysis using logCPM values after filtering of undectable genes. Sample S5 does appear most similar to the wiltype samples, however the genotype has been checked via expression of fmr1
I will use the generalised linear model functionality of the package edgeR to perform the DE analysis. edgeR uses a negative binomial variance function and estimates dispersions using the Cox-Reid profile-adjusted likelihood (CR) method. Tests for differential expression were perforned using likelihood ratio tests.
The design matrix specifies which samples contain which characteristics. Here, WT was set as the intercept (or baseline) and the effect of FMR1 KO was set as the slope coefficient.
design <- model.matrix(~group, data = x$samples)
colnames(design) <- str_remove(colnames(design), "group")
Dispersions were already esitmated when generating the dge object x. This code below fits the negative binomial model.
glmFit <- glmFit(x, design)
This code below tests for differential expression. Genes were considered DE if the FDR adjusted p-value was < 0.05.
topTable_glm <- glmFit %>%
glmLRT(coef = "FMR1") %>%
topTags(n = Inf, adjust.method = "fdr", sort.by = "p") %>%
.$table %>%
as_tibble() %>%
mutate(DE = FDR < 0.05)
Mean difference plots (MD)-plots show the average expression vs logFC. The DE genes had low/mild average expression (logCPM between ~0.3 and 6.7). Importantly, there appears to be a small bias in this dataset. This may impact GSEA analysis as genesets with low-medium expressed genes will appear as enriched for being upregulated.
md1 <- topTable_glm %>%
ggplot(aes(x = logCPM, y = logFC)) +
geom_point(aes(colour = DE), alpha = 0.5) +
geom_smooth(se = FALSE) +
scale_color_manual(values = c("grey50", "red")) +
geom_label_repel(
aes(label = external_gene_name, colour = DE),
data = . %>% dplyr::filter(DE == TRUE),
show.legend = FALSE
)
md1
In response to the observed bias CQN was used to rectify this issue as it may be derived from either a length or GC artefact. GC and length information is easily avaaible from EnsDb objects from release 98 onwards, and this release was used. No expresed genes were noted as absent from the annotations for Release 98, despite the differences between releases. GC content and Length were taken as weighted averages and simple averages respectively.
ens98 <- ah[["AH74989"]]
ex <- exonsBy(ens98, "tx")
tr <- transcripts(ens98)
tr$len <- ex %>%
width %>%
lapply(sum) %>%
.[names(tr)] %>%
unlist()
gnBias <- mcols(tr) %>%
as.data.frame() %>%
group_by(gene_id) %>%
summarise(
n = n(),
meanGC = sum(len*gc_content) / sum(len),
len = mean(len)
)
mcols(genes) %<>%
as.data.frame() %>%
left_join(gnBias) %>%
DataFrame()
gcCqn <- cqn(
counts = x$counts,
x = enframe(rownames(x), name = c()) %>%
dplyr::rename(gene_id = value) %>%
left_join(gnBias) %>%
.[["meanGC"]],
lengths = enframe(rownames(x), name = c()) %>%
dplyr::rename(gene_id = value) %>%
left_join(gnBias) %>%
.[["len"]],
sizeFactors = x$samples$lib.size
)
x$offset <- gcCqn$glm.offset
par(mfrow = c(1, 2))
cqnplot(gcCqn, n = 1, xlab = "GC Content")
cqnplot(gcCqn, n = 2, xlab = "Length")
Model fits for GC content and gene length under the CQN model. Variability is clearly visible.
par(mfrow = c(1, 1))
logCPM <- gcCqn$y + gcCqn$offset
logCPM %>%
t() %>%
prcomp() %>%
autoplot(data = x$samples, colour = "group", size = 4) +
geom_text_repel(
aes(label = sample, colour = group),
show.legend = FALSE
) +
labs(colour = "Genotype")
PCA after CQN showing similar patterns of separation as prior to this normalisation.
This code below tests for differential expression using Likelihood ratio tests. Genes were considered DE if the FDR adjusted p-value was < 0.05.
glmCQN <- glmFit(x, design)
topTableCQN <- glmCQN %>%
glmLRT(coef = "FMR1") %>%
topTags(n = Inf, adjust.method = "fdr", sort.by = "p") %>%
.$table %>%
as_tibble() %>%
mutate(DE = FDR < 0.05)
md2 <- topTableCQN %>%
ggplot(aes(x = logCPM, y = logFC)) +
geom_point(aes(colour = DE), alpha = 0.5) +
geom_smooth(se = FALSE) +
scale_color_manual(values = c("grey50", "red")) +
geom_label_repel(
aes(label = external_gene_name, colour = DE),
data = . %>% dplyr::filter(DE == TRUE),
show.legend = FALSE
)
plot_grid(
md1 +
coord_cartesian(ylim = c(-3, 3)) +
theme(legend.position = "none"),
md2 +
coord_cartesian(ylim = c(-3, 3)) +
theme(
legend.position = c(1, 0),
legend.justification = c(1, 0),
legend.background = element_rect(colour = "grey30", size = 1/4)
),
nrow = 1,
labels = c("A", "B")
)
Comparison of MD plots before and after CQN. The bias evident in A) the pre-CQN plots is no longer present in B) the post-CQN plot, suggesting GC and length bias were contributing factors.
topTableCQN %>%
ggplot(
aes(x = logFC, y = -log10(PValue), colour = DE)
) +
geom_point(alpha = 0.5) +
scale_color_manual(values = c("grey50", "red")) +
geom_label_repel(
aes(label = external_gene_name),
data = . %>% dplyr::filter(DE == TRUE),
show.legend = FALSE
) +
theme_bw()
A total of 22 genes were considered as significant as given in the table below.
topTableCQN %>%
dplyr::filter(DE) %>%
dplyr::select(
ID = ensembl_gene_id,
Gene = external_gene_name,
starts_with("log"),
PValue, FDR,
Location = location,
Description = description
) %>%
datatable(
caption = "The most highly ranked genes for differential expression.",
rownames = FALSE,
options = list(
pageLength = 25
)
) %>%
formatRound(
columns = c("logFC", "logCPM"),
digits = 2
) %>%
formatSignif(
columns = c("PValue", "FDR"),
digits = 3
)
topTable_glm %>%
write.csv(here::here("results/DE_results_fmr1KO_KB.csv"))
LogCPMs of the significantly DE genes are shown in the plot below.
logCPM %>%
.[dplyr::filter(topTableCQN, DE == TRUE)$ensembl_gene_id,] %>%
as.data.frame() %>%
rownames_to_column("ensembl_gene_id") %>%
gather(key = "sample", value = "logCPM", colnames(x$counts)) %>%
left_join(x$samples, by = "sample") %>%
left_join(x$genes) %>%
ggplot(aes(x = group, y = logCPM, fill = group)) +
geom_boxplot(alpha = 0.5) +
geom_jitter(width = 0.1, colour = "grey30") +
facet_wrap(~external_gene_name, scales = "free_y", ncol = 4) +
theme_bw() +
theme(legend.position = "none")
Boxplots of all DE genes with points overlaid
pwf <- topTableCQN %>%
left_join(gnBias, by = c("ensembl_gene_id" = "gene_id")) %>%
with(
nullp(
DEgenes = structure(DE, names = ensembl_gene_id),
bias.data = meanGC
)
)
ens2Entrez <- genes[rownames(x)] %>%
mcols() %>%
as.data.frame() %>%
as_tibble() %>%
dplyr::select(
gene_id,
entrez_gene = entrezid
) %>%
unchop(entrez_gene) %>%
dplyr::filter(!is.na(entrez_gene))
geneByPos <- x$genes %>%
as_tibble() %>%
separate(location, into = c("chr", "range", "strand"), sep = ":") %>%
split(f = .$ensembl_gene_id) %>%
lapply(extract2, "chr")
posByChr <- x$genes %>%
as_tibble() %>%
separate(location, into = c("chr", "range", "strand"), sep = ":") %>%
split(f = .$chr) %>%
lapply(extract2, "ensembl_gene_id")
hm <- msigdbr("Danio rerio", category = "H") %>%
inner_join(ens2Entrez) %>%
distinct(gs_name, gene_id, .keep_all = TRUE)
hmByGene <- hm %>%
split(f = .$gene_id) %>%
lapply(extract2, "gs_name")
hmByID <- hm %>%
split(f = .$gs_name) %>%
lapply(extract2, "gene_id")
kg <- msigdbr("Danio rerio", category = "C2", subcategory = "CP:KEGG") %>%
inner_join(ens2Entrez) %>%
distinct(gs_name, gene_id, .keep_all = TRUE)
kgByGene <- kg %>%
split(f = .$gene_id) %>%
lapply(extract2, "gs_name")
kgByID <- kg %>%
split(f = .$gs_name) %>%
lapply(extract2, "gene_id")
goseq(pwf, gene2cat = geneByPos) %>%
dplyr::filter(numDEInCat > 0) %>%
as_tibble() %>%
mutate(
adjP = p.adjust(over_represented_pvalue, "bonferroni"),
Expected = sum(topTableCQN$DE) * numInCat / nrow(x)
) %>%
dplyr::select(
chromosome = category, Expected, Observed = numDEInCat, N = numInCat, PValue = over_represented_pvalue, adjP
) %>%
pander(
caption = "Results for enrichment testing within the DE set of genes, based on chromosomal location."
)
Using manually entered categories. Calculating the p-values…
| chromosome | Expected | Observed | N | PValue | adjP |
|---|---|---|---|---|---|
| 14 | 0.7991 | 12 | 664 | 0 | 0 |
| 7 | 1.134 | 3 | 942 | 0.1189 | 0.9513 |
| 6 | 0.9869 | 2 | 820 | 0.2752 | 1 |
| 10 | 0.7859 | 1 | 653 | 0.5314 | 1 |
| 22 | 0.757 | 1 | 629 | 0.5699 | 1 |
| 17 | 0.8461 | 1 | 703 | 0.57 | 1 |
| 19 | 0.8677 | 1 | 721 | 0.5838 | 1 |
| 4 | 0.8364 | 1 | 695 | 0.5869 | 1 |
goseq(pwf, gene2cat = hmByGene) %>%
dplyr::filter(numDEInCat > 0) %>%
as_tibble() %>%
mutate(
adjP = p.adjust(over_represented_pvalue, "bonferroni"),
Expected = sum(topTableCQN$DE) * numInCat / nrow(x)
) %>%
dplyr::select(
Category = category, Expected, Observed = numDEInCat, `GS Size` = numInCat, PValue = over_represented_pvalue, adjP
) %>%
pander(
caption = "Results for enrichment testing for Hallmark Gene Sets within the DE set of genes",
split.tables = Inf,
justify = "lrrrrr"
)
Using manually entered categories. For 14633 genes, we could not find any categories. These genes will be excluded. To force their use, please run with use_genes_without_cat=TRUE (see documentation). This was the default behavior for version 1.15.1 and earlier. Calculating the p-values…
| Category | Expected | Observed | GS Size | PValue | adjP |
|---|---|---|---|---|---|
| HALLMARK_ESTROGEN_RESPONSE_EARLY | 0.207 | 2 | 172 | 0.01198 | 0.08385 |
| HALLMARK_PROTEIN_SECRETION | 0.1119 | 1 | 93 | 0.1011 | 0.7079 |
| HALLMARK_INFLAMMATORY_RESPONSE | 0.1396 | 1 | 116 | 0.1269 | 0.8886 |
| HALLMARK_COMPLEMENT | 0.1889 | 1 | 157 | 0.1678 | 1 |
| HALLMARK_ESTROGEN_RESPONSE_LATE | 0.1974 | 1 | 164 | 0.1775 | 1 |
| HALLMARK_P53_PATHWAY | 0.219 | 1 | 182 | 0.1871 | 1 |
| HALLMARK_MTORC1_SIGNALING | 0.2383 | 1 | 198 | 0.206 | 1 |
goseq(pwf, gene2cat = kgByGene) %>%
dplyr::filter(numDEInCat > 0) %>%
as_tibble() %>%
mutate(
adjP = p.adjust(over_represented_pvalue, "bonferroni"),
Expected = sum(topTableCQN$DE) * numInCat / nrow(x)
) %>%
dplyr::select(
Category = category, Expected, Observed = numDEInCat, `GS Size` = numInCat, PValue = over_represented_pvalue, adjP
) %>%
pander(
caption = "Results for enrichment testing for KEGG Gene Sets within the DE set of genes",
split.tables = Inf,
justify = "lrrrrr"
)
Using manually entered categories. For 14556 genes, we could not find any categories. These genes will be excluded. To force their use, please run with use_genes_without_cat=TRUE (see documentation). This was the default behavior for version 1.15.1 and earlier. Calculating the p-values…
| Category | Expected | Observed | GS Size | PValue | adjP |
|---|---|---|---|---|---|
| KEGG_LYSOSOME | 0.1372 | 2 | 114 | 0.003041 | 0.02433 |
| KEGG_GLYCOSPHINGOLIPID_BIOSYNTHESIS_GLOBO_SERIES | 0.008425 | 1 | 7 | 0.005353 | 0.04282 |
| KEGG_GALACTOSE_METABOLISM | 0.02407 | 1 | 20 | 0.01409 | 0.1128 |
| KEGG_SPHINGOLIPID_METABOLISM | 0.04092 | 1 | 34 | 0.02638 | 0.2111 |
| KEGG_DRUG_METABOLISM_CYTOCHROME_P450 | 0.03611 | 1 | 30 | 0.02822 | 0.2258 |
| KEGG_METABOLISM_OF_XENOBIOTICS_BY_CYTOCHROME_P450 | 0.03731 | 1 | 31 | 0.0284 | 0.2272 |
| KEGG_GLYCEROLIPID_METABOLISM | 0.04453 | 1 | 37 | 0.02907 | 0.2325 |
| KEGG_GLUTATHIONE_METABOLISM | 0.04934 | 1 | 41 | 0.03899 | 0.3119 |
Since only 14 genes were found as DE, I will perform a GSEA on ranked lists (rather than overrepresentation in the DE genes) to see if any groups of genes within pre-defined gene sets are showing shifts towards up or downregulation as a group. I will look at the KEGG and HALLMARK gene sets available from MSigDB (https://www.gsea-msigdb.org/gsea/msigdb/genesets.jsp). These gene sets have human gene entrez identifiers. So I will convert these human entrez ids to zebrafish ensembl ids using a mapping file downloaded from bioMart (https://m.ensembl.org/biomart). Some genes in the KEGG pathway gene sets did not contain a zebrafish orthologue. Therefore, the gene sets sometimes contained < 10 genes. I filtered out any KEGG pathway gene sets with less than 10 genes as they would not be particularly informative.
fry from the limma package approximates the ROAST method. ROAST uses residual space rotation as a sort of continuous version of sample permutation. Like permutation tests, it protects against false positives caused by correlations between genes in the set (Wu et al, 2010). Using this method, I did not find any significantly altered gene sets in a directional context (FDR column), or mixed (so no directionality is taken into account (FDR.mixed column)).
fryRes <- logCPM %>%
fry(
index = c(hmByID, kgByID, posByChr),
design = design,
contrast = "FMR1",
sort = "mixed"
) %>%
rownames_to_column("geneset") %>%
as_tibble()
fryRes %>%
arrange(PValue.Mixed) %>%
mutate(FDR.Mixed = p.adjust(PValue.Mixed, "fdr")) %>%
dplyr::select(geneset, NGenes, PValue = PValue.Mixed, FDR = FDR.Mixed) %>%
dplyr::slice(1:10) %>%
pander(
caption = paste(
"The", nrow(.), "most highly ranked gene sets across Chromosomal, Hallmark and KEGG testing using Fry.",
"Tests were non-directional (Mixed)."
),
split.tables = Inf,
justify = "lrrr"
)
| geneset | NGenes | PValue | FDR |
|---|---|---|---|
| 14 | 664 | 0.0006031 | 0.158 |
| KEGG_HISTIDINE_METABOLISM | 23 | 0.004863 | 0.3537 |
| HALLMARK_MYC_TARGETS_V2 | 57 | 0.005539 | 0.3537 |
| KEGG_VALINE_LEUCINE_AND_ISOLEUCINE_DEGRADATION | 44 | 0.006401 | 0.3537 |
| KEGG_RNA_POLYMERASE | 26 | 0.007102 | 0.3537 |
| KEGG_GLYCOSPHINGOLIPID_BIOSYNTHESIS_GLOBO_SERIES | 7 | 0.008145 | 0.3537 |
| KEGG_GALACTOSE_METABOLISM | 20 | 0.009449 | 0.3537 |
| KEGG_GLYCEROLIPID_METABOLISM | 37 | 0.01628 | 0.3867 |
| KEGG_RNA_DEGRADATION | 54 | 0.01818 | 0.3867 |
| KEGG_GLYCOSYLPHOSPHATIDYLINOSITOL_GPI_ANCHOR_BIOSYNTHESIS | 22 | 0.01909 | 0.3867 |
Fry showed that genes on chromosome 14 tend to be differentially expressed. I want to determine whether there may be some biological reason why this is oberved. Therefore, I will test whether the top most altered genes from chromosome 14 are enriched in any KEGG or HALLMARK gene sets,
I will use fgsea, the fast implementation of the GSEA algorithm, using the chromosomal position gene sets on an undirectional ranked list (i.e. most DE -> to least DE). The leading edge fromt fgsea will give the genes which are the most dysregulated from chromosome 14.
I then will perform an over-representation analysis (using the KEGG and HALLMARK gene sets) on the leading edge genes to determine whether they are have a particular biological function.
# make the ranks
ranks <- topTableCQN %>%
mutate(rank = log10(1/PValue)) %>%
arrange(rank) %>%
dplyr::select(c("ensembl_gene_id", "rank")) %>% #only want the Pvalue with sign
with(structure(rank, names = ensembl_gene_id)) %>%
rev()
png("rankedlist.png", width = 10, height = 10, units = "cm", res = 600)
plot(ranks,
xlab = "Genes", ylab = "-log10(PValue)")
dev.off()
## quartz_off_screen
## 2
# perform fgsea on all the chr gene sets.
gsea_chr <- fgseaSimple(pathways = posByChr,
stats = ranks,
nperm = 1e5
) %>%
as_tibble()
gsea_chr %>%
arrange(pval) %>%
mutate(padj = p.adjust(pval, "bonferroni")) %>%
dplyr::filter(padj < 0.05) %>%
dplyr::select(-leadingEdge) %>%
pander(
caption = "Results for enrichment testing using fgsea based on chr position."
)
| pathway | pval | padj | ES | NES | nMoreExtreme | size |
|---|---|---|---|---|---|---|
| 14 | 1e-05 | 0.00026 | 0.5325 | 1.38 | 0 | 664 |
| 22 | 3e-05 | 0.00078 | 0.4641 | 1.202 | 2 | 629 |
# plot the enrichment
png("chr14_enrichment.png", width = 10, height = 10, units = "cm", res = 600)
plotEnrichment(pathway = posByChr$`14`,
stats = ranks)
dev.off()
## quartz_off_screen
## 2
png("chr22_enrichment.png", width = 10, height = 10, units = "cm", res = 600)
plotEnrichment(pathway = posByChr$`22`,
stats = ranks)
dev.off()
## quartz_off_screen
## 2
Next, I will take the leading edge for chromosome 14 (i.e. the genes in the ranked list from chromosome 14 which are before the peak on the enrichment plot above) and determine whether they are over-represented in any KEGG gene sets using the kegga function from the limma package.
First, we perfomed over-representation analysis using kegga() of the leading edge subset against all detectable genes in the RNA-seq experiment. The KEGG gene sets for RNA polymerase and NEUROACTIVE_LIGAND_RECEPTOR_INTERACTION were found to be signifcatly enriched (by FDR).
kegga(de = gsea_chr %>% dplyr::filter(pathway == "14") %>%
.$leadingEdge %>% .[[1]],
geneid = rownames(x$counts),
universe = rownames(x$counts),
gene.pathway = kgByID %>%
unlist %>%
as.data.frame() %>%
rownames_to_column("pathway") %>%
set_colnames(c("pathway", "gene_id")) %>%
mutate(pathway = pathway %>% str_remove_all("[0-9]+$")) %>%
dplyr::select(gene_id, pathway)
) %>%
topKEGG() %>%
as.data.frame() %>%
rownames_to_column("pathway") %>%
mutate(FDR = p.adjust(P.DE, "fdr"),
pbonf = p.adjust(P.DE, "bonferroni")) %>%
dplyr::select(pathway, raw.p = P.DE, FDR, pbonf, `No. in leadingedge` = DE, `No. in geneset` = N) %>%
head(10) %>%
pander(caption = "Results of over-representation of genes in the leading edge of chromosome 14 against all genes detetcted in RNA-seq experiment")
| pathway | raw.p | FDR |
|---|---|---|
| KEGG_RNA_POLYMERASE | 0.003029 | 0.04454 |
| KEGG_NEUROACTIVE_LIGAND_RECEPTOR_INTERACTION | 0.004454 | 0.04454 |
| KEGG_AMYOTROPHIC_LATERAL_SCLEROSIS_ALS | 0.01777 | 0.09158 |
| KEGG_GLYCOSAMINOGLYCAN_BIOSYNTHESIS_HEPARAN_SULFATE | 0.02737 | 0.09158 |
| KEGG_LYSOSOME | 0.04011 | 0.09158 |
| KEGG_APOPTOSIS | 0.04143 | 0.09158 |
| KEGG_LONG_TERM_POTENTIATION | 0.04143 | 0.09158 |
| KEGG_DRUG_METABOLISM_CYTOCHROME_P | 0.04472 | 0.09158 |
| KEGG_METABOLISM_OF_XENOBIOTICS_BY_CYTOCHROME_P | 0.04746 | 0.09158 |
| KEGG_TYROSINE_METABOLISM | 0.05311 | 0.09158 |
| pbonf | No. in leadingedge | No. in geneset |
|---|---|---|
| 0.06058 | 3 | 26 |
| 0.08908 | 6 | 136 |
| 0.3554 | 3 | 49 |
| 0.5475 | 2 | 23 |
| 0.8022 | 4 | 114 |
| 0.8286 | 3 | 68 |
| 0.8286 | 3 | 68 |
| 0.8944 | 2 | 30 |
| 0.9491 | 2 | 31 |
| 1 | 2 | 33 |
As an alternate viewpoint, an over-representation analysis was performed on the leading edge genes from chr14 relative to all genes on chromosome 14. No significant gene sets were observed.
kegga(de = gsea_chr %>% dplyr::filter(pathway == "14") %>%
.$leadingEdge %>% .[[1]],
geneid = rownames(x$counts),
universe = x$genes %>%
dplyr::filter(grepl(x = location, pattern = "^14:")) %>%
.$ensembl_gene_id,
gene.pathway = kgByID %>%
unlist %>%
as.data.frame() %>%
rownames_to_column("pathway") %>%
set_colnames(c("pathway", "gene_id")) %>%
mutate(pathway = pathway %>% str_remove_all("[0-9]+$")) %>%
dplyr::select(gene_id, pathway)
) %>%
topKEGG() %>%
as.data.frame() %>%
rownames_to_column("pathway") %>%
mutate(FDR = p.adjust(P.DE, "fdr"),
pbonf = p.adjust(P.DE, "bonferroni")) %>%
dplyr::select(pathway, raw.p = P.DE, FDR, pbonf, `No. in leadingedge` = DE, `No. in geneset` = N) %>%
head(10) %>%
pander(caption = "Results of over-representation of genes in the leading edge of chromosome 14 against all genes on chr 14")
| pathway | raw.p | FDR | pbonf |
|---|---|---|---|
| KEGG_AMYOTROPHIC_LATERAL_SCLEROSIS_ALS | 0.02956 | 0.1599 | 0.5912 |
| KEGG_LONG_TERM_POTENTIATION | 0.02956 | 0.1599 | 0.5912 |
| KEGG_ALZHEIMERS_DISEASE | 0.03428 | 0.1599 | 0.6856 |
| KEGG_FOCAL_ADHESION | 0.07797 | 0.1599 | 1 |
| KEGG_NEUROACTIVE_LIGAND_RECEPTOR_INTERACTION | 0.0886 | 0.1599 | 1 |
| KEGG_APOPTOSIS | 0.09101 | 0.1599 | 1 |
| KEGG_RNA_POLYMERASE | 0.09101 | 0.1599 | 1 |
| KEGG_DNA_REPLICATION | 0.09593 | 0.1599 | 1 |
| KEGG_MTOR_SIGNALING_PATHWAY | 0.09593 | 0.1599 | 1 |
| KEGG_NATURAL_KILLER_CELL_MEDIATED_CYTOTOXICITY | 0.09593 | 0.1599 | 1 |
| No. in leadingedge | No. in geneset |
|---|---|
| 3 | 3 |
| 3 | 3 |
| 4 | 5 |
| 4 | 6 |
| 6 | 11 |
| 3 | 4 |
| 3 | 4 |
| 2 | 2 |
| 2 | 2 |
| 2 | 2 |
It has come to our attention that some of the samples (i.e. pools of larvae) may have arisen from fish heterozygous for a nacre (w2) allele of the gene mifta, as well as the fmr1 gene.
In each pool (sample), the larvae which showed a nacre phenotype (i.e. no pigment cells) were removed from the analysis. However, these would be larvae which are homozygous for the nacre allele. Heterozygous nacre larvae could also be present in this sample.
To investigate to what extent there are contaminiating nacre larvae in our RNA-seq samples, I will look at the number of reads which align to the nacre site on chromosome 6.
# commenting out this section so I dont need to mount phoenix each time
# the bam files are on th mounted drive on phoenix.
# bams <- list.files("/fast/users/a1211024/fmr1/03_alignedData/bams", pattern = ".bam")
# turn this off as we aligned to ensembl genome
# options(ucscChromosomeNames=FALSE)
#
# #make the reference sequence with the chromosome name as 6, not chr6
# temp = DNAStringSet(c("6"=paste(as.vector(Drerio$chr6), length(Drerio$chr6), TRUE),collapse=""))
#
# atrack_s2 <- AlignmentsTrack(
# "/fast/users/a1211024/fmr1/03_alignedData/bams/SRR11873955_GSM4578112_fmr1_1_Danio_rerio_RNA-Seq_Aligned.sortedByCoord.out.bam",
# isPaired = TRUE, chromosome = "6", genome = "danRer11",
# name = "Alignment", max.height = 10,
# referenceSequence = SequenceTrack(temp,
# name = "Ref", chromosome = "6"),
# size = 10)
#
# png("gviz/sample_s2.png", width = 2.5, height = 20, units = "cm", res = 1000)
# plotTracks(list(atrack_s2), chromosome = "6",
# from = 43429185, to = 43429186, main = "S2", sizes = 1)
# dev.off()
#
#
# atrack_s4 <- AlignmentsTrack(
# "/fast/users/a1211024/fmr1/03_alignedData/bams/SRR11873956_GSM4578113_fmr1_2_Danio_rerio_RNA-Seq_Aligned.sortedByCoord.out.bam",
# isPaired = TRUE, chromosome = "6", genome = "danRer11",
# name = "Alignment", max_height = 10,
# referenceSequence = SequenceTrack(temp,
# name = "Ref", chromosome = "6"))
#
# png("gviz/sample_s4.png", width = 2.5, height = 20, units = "cm", res = 1000)
# plotTracks(list(atrack_s4), chromosome = "6",
# from = 43429185, to = 43429186, main = "S4")
# dev.off()
#
# atrack_s5 <- AlignmentsTrack(
# "/fast/users/a1211024/fmr1/03_alignedData/bams/SRR11873957_GSM4578114_fmr1_3_Danio_rerio_RNA-Seq_Aligned.sortedByCoord.out.bam",
# isPaired = TRUE, chromosome = "6", genome = "danRer11",
# name = "Alignment",
# referenceSequence = SequenceTrack(temp,
# name = "Ref", chromosome = "6"),
# size = 10)
# png("gviz/sample_s5.png", width = 2.5, height = 20, units = "cm", res = 1000)
# plotTracks(list(atrack_s5), chromosome = "6",
# from = 43429185, to = 43429186, main = "S5")
# dev.off()
#
# atrack_s8 <- AlignmentsTrack(
# "/fast/users/a1211024/fmr1/03_alignedData/bams/SRR11873958_GSM4578115_fmr1_4_Danio_rerio_RNA-Seq_Aligned.sortedByCoord.out.bam",
# isPaired = TRUE, chromosome = "6", genome = "danRer11",
# name = "Alignment",
# referenceSequence = SequenceTrack(temp,
# name = "Ref", chromosome = "6"),
# size = 10)
#
#
# png("gviz/sample_s8.png", width = 2.5, height = 20, units = "cm", res = 1000)
# plotTracks(list(atrack_s8), chromosome = "6",
# from = 43429185, to = 43429186, main = "S8")
# dev.off()
#
# atrack_A <- AlignmentsTrack(
# "/fast/users/a1211024/fmr1/03_alignedData/bams/SRR11873959_GSM4578116_WT_1_Danio_rerio_RNA-Seq_Aligned.sortedByCoord.out.bam",
# isPaired = TRUE, chromosome = "6", genome = "danRer11",
# name = "Alignment",
# referenceSequence = SequenceTrack(temp,
# name = "Ref", chromosome = "6"),
# size = 10)
#
# png("gviz/sample_A.png", width = 2.5, height = 20, units = "cm", res = 1000)
# plotTracks(list(atrack_A), chromosome = "6",
# from = 43429185, to = 43429186, main = "A")
# dev.off()
#
# atrack_D <- AlignmentsTrack(
# "/fast/users/a1211024/fmr1/03_alignedData/bams/SRR11873960_GSM4578117_WT_2_Danio_rerio_RNA-Seq_Aligned.sortedByCoord.out.bam",
# isPaired = TRUE, chromosome = "6", genome = "danRer11",
# name = "Alignment",
# referenceSequence = SequenceTrack(temp,
# name = "Ref", chromosome = "6"),
# size = 10)
# png("gviz/sample_D.png", width = 2.5, height = 20, units = "cm", res = 1000)
# plotTracks(list(atrack_D), chromosome = "6",
# from = 43429185, to = 43429186, main = "D")
# dev.off()
#
#
# atrack_G <- AlignmentsTrack(
# "/fast/users/a1211024/fmr1/03_alignedData/bams/SRR11873961_GSM4578118_WT_3_Danio_rerio_RNA-Seq_Aligned.sortedByCoord.out.bam",
# isPaired = TRUE, chromosome = "6", genome = "danRer11",
# name = "Alignment",
# referenceSequence = SequenceTrack(temp,
# name = "Ref", chromosome = "6"))
#
# png("gviz/sample_G.png", width = 2.5, height = 20, units = "cm", res = 1000)
# plotTracks(list(atrack_G), chromosome = "6",
# from = 43429185, to = 43429186, main = "G")
# dev.off()
#
# atrack_L <- AlignmentsTrack(
# "/fast/users/a1211024/fmr1/03_alignedData/bams/SRR11873962_GSM4578119_WT_4_Danio_rerio_RNA-Seq_Aligned.sortedByCoord.out.bam",
# isPaired = TRUE, chromosome = "6", genome = "danRer11",
# name = "Alignment",
# referenceSequence = SequenceTrack(temp,
# name = "Ref", chromosome = "6"))
#
# png("gviz/sample_L.png", width = 2.5, height = 20, units = "cm", res = 1000)
# plotTracks(list(atrack_L), chromosome = "6",
# from = 43429185, to = 43429186, main = "L")
# dev.off()
These next few plots are exploring the correlations between the degree of contamination of the nacre allele with the first four principal components (which together, explain ~85% of the total variance in this dataset).
# table showing number of reads aligning to the W2 site according to samtools view
nacre <- tibble(sample = colnames(x),
num_reads = c(39, 51, 55, 44, 47, 58, 65, 42),
num_nacre_reads = c(1, 4, 16, 0, 0, 16, 0, 3)) %>%
mutate(prop_nacre = num_nacre_reads*100/num_reads)
nacre %>%
left_join(x$samples) %>%
ggplot(aes(sample, prop_nacre)) +
geom_col(aes(fill = group)) +
scale_fill_manual(values = c("grey50", "red")) +
facet_wrap(~group, scales = "free_x") +
labs(x = "Sample", y = "Proportion of nacre reads", fill = "Genotype") +
ggtitle("Summary of proportion of reads aligning to nacre site")
correlation of the first four principal components with the proportion of reads aligning to the nacre allele.
#ggsave("gviz/nacreProp.png", width = 10, height = 5, units = "cm", dpi = 600, scale = 1.2)
# PCA on cqn adjusted vals.
logCPM %>%
t() %>%
prcomp %>%
.$x %>%
as.data.frame() %>%
rownames_to_column("sample") %>%
left_join(nacre %>% dplyr::select(sample, prop_nacre)) %>%
left_join(x$samples %>% dplyr::select(sample, c( fmr1_genotype = group), lib.size)) %>%
dplyr::select(-c(PC5, PC6, PC7, PC8)) %>%
mutate_at(vars(fmr1_genotype), as.integer) %>%
column_to_rownames("sample") %>%
cor() %>%
corrplot::corrplot(
diag = FALSE,
title = "correlation between the first four PCs\nof the cqn adjusted logCPMs,\nfmr1 genotype, library size,\nand proportion of reads aligning to nacre allele",
type = "upper",
addCoef.col = "black"
)
correlation of principal components with fmr1 genotype and proportion of nacre reads. The proportion of nacre reads has almost as much correlation as fmr1 genotype with the firsr four PCs.
logCPM %>%
t() %>%
prcomp %>%
.$x %>%
as.data.frame() %>%
rownames_to_column("sample") %>%
left_join(nacre %>% dplyr::select(sample, prop_nacre)) %>%
left_join(x$samples %>% dplyr::select(sample, c( fmr1_genotype = group), lib.size)) %>%
pivot_longer(cols = starts_with("PC"), values_to = "PC_value", names_to = "Component") %>%
dplyr::filter(Component %in% c("PC1", "PC2", "PC3", "PC4")) %>%
ggplot(aes(x = PC_value, y = prop_nacre)) +
geom_point(aes(colour = fmr1_genotype)) +
theme_bw() +
facet_wrap(~Component, scales = "free_x") +
geom_label_repel(aes(label = sample)) +
geom_smooth(method = "lm") +
scale_color_manual(values = c("grey50", "red")) +
theme(legend.position = "bottom") +
labs(x = "Principal component value",
y = "Proportion of nacre reads",
colour = "fmr1 genotype") +
ggtitle("PC values vs proportion nacre")
correlation of principal components with fmr1 genotype and proportion of nacre reads. The proportion of nacre reads has almost as much correlation as fmr1 genotype with the firsr four PCs.
#ggsave("gviz/PCvsNacre.png", width = 7.5, height = 7.5, units = "cm", scale = 1.5, dpi = 600)
Some correlation is observed with the value of the first four principal components with the proportion of reads aligning to the nacre allele. However, it does not appear to be highly significant, due to the standard errors of the regression lines being quite large.
The proportion of nacre reads is a random effect. To incoporate the effect of the degree of nacre contaminantion, I will convert the proportion of nacre contamination to a catagorial covariate (as percentages/proportions do not work well in linear models see https://stats.stackexchange.com/questions/103731/what-are-the-issues-with-using-percentage-outcome-in-linear-regression ),
Proportion of nacre alleles will be classified into three bins: - no nacre (0) - low nacre (1 > x < 10) or - high nacre (x > 10).
# add nacre to x$samples
x$samples %<>%
left_join(nacre %>%
dplyr::select(sample, prop_nacre)) %>%
mutate(nacre_factor = case_when(
prop_nacre == 0 ~ "none",
prop_nacre < 10 ~ "high",
TRUE ~ "low") %>%
as.factor())
x$samples$nacre_factor <- relevel(x$samples$nacre_factor, ref = 'none')
To deal with the random effect, I will use a blocking variable in limma: duplicateCorrelations, http://web.mit.edu/~r/current/arch/i386_linux26/lib/R/library/limma/html/dupcor.html
The example I am mostly following is found here. https://www.biostars.org/p/54565/
# perform the voom transformation
voom <- voom(x, design, plot=FALSE)
# estimate the correlation between replicates on the blocking variable (nacre proporttion.
# this takes a while, as it fits a mixed model for each gene
corfit <- duplicateCorrelation(voom, design, block=x$samples$nacre_factor)
# fit the linear model
fit <- lmFit(voom, design,
block=x$samples$nacre_factor,
correlation=corfit$consensus.correlation) %>%
eBayes()
# call the toptable
topTable_nacreBlock <- topTable(fit, coef="FMR1", n=Inf, adjust.method = "fdr", sort.by = "p") %>%
as_tibble() %>%
mutate(DE = adj.P.Val < 0.05)
topTable_nacreBlock %>%
head(10) %>%
pander(caption = "top 20 most DE genes after using the degree of nacre contamination as a blocking variable")
| ensembl_gene_id | location |
|---|---|
| ENSDARG00000037433 | 14:20156477-20191281:+ |
| ENSDARG00000041257 | 14:17108003-17121713:- |
| ENSDARG00000052609 | 14:9056071-9066756:+ |
| ENSDARG00000037097 | 14:30367602-30390145:- |
| ENSDARG00000104007 | 10:21650828-21820057:+ |
| ENSDARG00000102435 | 7:45975537-45976956:+ |
| ENSDARG00000036155 | 14:38932441-38946808:- |
| ENSDARG00000101393 | 6:18364048-18366339:- |
| ENSDARG00000104491 | 22:1205787-1245226:- |
| ENSDARG00000061585 | 14:29975268-29990757:+ |
| description | external_gene_name | logFC |
|---|---|---|
| fragile X mental retardation 1 [Source:ZFIN;Acc:ZDB-GENE-020731-6] | fmr1 | -1.195 |
| smoothelin-like 1 [Source:ZFIN;Acc:ZDB-GENE-050306-23] | smtnl1 | 2.009 |
| uncharacterized LOC100005318 [Source:NCBI gene;Acc:100005318] | CU468164.1 | 5.608 |
| solute carrier family 7 (cationic amino acid transporter, y+ system), member 2 [Source:ZFIN;Acc:ZDB-GENE-041212-5] | slc7a2 | 1.097 |
| protocadherin 1 gamma 1 [Source:ZFIN;Acc:ZDB-GENE-050608-1] | pcdh1g1 | 1.821 |
| pleckstrin homology domain containing, family F (with FYVE domain) member 1 [Source:ZFIN;Acc:ZDB-GENE-040426-1289] | plekhf1 | 1.381 |
| galactosidase, alpha [Source:ZFIN;Acc:ZDB-GENE-041010-207] | gla | 1.699 |
| si:dkey-31g6.6 [Source:ZFIN;Acc:ZDB-GENE-110209-1] | si:dkey-31g6.6 | -0.9758 |
| adenosine monophosphate deaminase 2a [Source:ZFIN;Acc:ZDB-GENE-091204-4] | ampd2a | 0.7346 |
| cytochrome P450, family 4, subfamily V, polypeptide 7 [Source:ZFIN;Acc:ZDB-GENE-061103-88] | cyp4v7 | 1.476 |
| AveExpr | t | P.Value | adj.P.Val | B | DE |
|---|---|---|---|---|---|
| 5.029 | -12.08 | 3.088e-07 | 0.005646 | 4.508 | TRUE |
| 2.962 | 9.529 | 2.706e-06 | 0.02474 | 1.716 | TRUE |
| -1.801 | 8.892 | 5.024e-06 | 0.03061 | -2.048 | TRUE |
| 3.746 | 6.134 | 0.0001165 | 0.358 | 0.6948 | FALSE |
| 0.8814 | 6.07 | 0.0001268 | 0.358 | -1.244 | FALSE |
| 3.081 | 6.056 | 0.0001291 | 0.358 | 0.2514 | FALSE |
| 1.813 | 6.01 | 0.0001371 | 0.358 | -0.6305 | FALSE |
| 5.529 | -5.652 | 0.0002218 | 0.4224 | 0.6961 | FALSE |
| 2.666 | 5.574 | 0.000247 | 0.4224 | -0.2985 | FALSE |
| 1.103 | 5.501 | 0.0002731 | 0.4224 | -1.327 | FALSE |
I also want to see the limma analysis without the blocking variable to determine whether the results change.
# fit the linear model
fit_noBlock <- lmFit(voom, design) %>%
eBayes()
# call the toptable
topTable_nonacreBlock <- topTable(fit_noBlock, coef="FMR1", n=Inf, adjust.method = "fdr", sort.by = "p") %>%
as_tibble() %>%
mutate(DE = adj.P.Val < 0.05)
topTable_nonacreBlock %>%
head(10) %>%
pander(caption = "top 20 most DE genes after using the degree of nacre contamination as a blocking variable")
| ensembl_gene_id | location |
|---|---|
| ENSDARG00000037433 | 14:20156477-20191281:+ |
| ENSDARG00000041257 | 14:17108003-17121713:- |
| ENSDARG00000052609 | 14:9056071-9066756:+ |
| ENSDARG00000037097 | 14:30367602-30390145:- |
| ENSDARG00000102435 | 7:45975537-45976956:+ |
| ENSDARG00000104007 | 10:21650828-21820057:+ |
| ENSDARG00000036155 | 14:38932441-38946808:- |
| ENSDARG00000104491 | 22:1205787-1245226:- |
| ENSDARG00000101393 | 6:18364048-18366339:- |
| ENSDARG00000104353 | 6:9664206-9695294:- |
| description | external_gene_name | logFC |
|---|---|---|
| fragile X mental retardation 1 [Source:ZFIN;Acc:ZDB-GENE-020731-6] | fmr1 | -1.193 |
| smoothelin-like 1 [Source:ZFIN;Acc:ZDB-GENE-050306-23] | smtnl1 | 2.027 |
| uncharacterized LOC100005318 [Source:NCBI gene;Acc:100005318] | CU468164.1 | 5.588 |
| solute carrier family 7 (cationic amino acid transporter, y+ system), member 2 [Source:ZFIN;Acc:ZDB-GENE-041212-5] | slc7a2 | 1.092 |
| pleckstrin homology domain containing, family F (with FYVE domain) member 1 [Source:ZFIN;Acc:ZDB-GENE-040426-1289] | plekhf1 | 1.389 |
| protocadherin 1 gamma 1 [Source:ZFIN;Acc:ZDB-GENE-050608-1] | pcdh1g1 | 1.8 |
| galactosidase, alpha [Source:ZFIN;Acc:ZDB-GENE-041010-207] | gla | 1.7 |
| adenosine monophosphate deaminase 2a [Source:ZFIN;Acc:ZDB-GENE-091204-4] | ampd2a | 0.7355 |
| si:dkey-31g6.6 [Source:ZFIN;Acc:ZDB-GENE-110209-1] | si:dkey-31g6.6 | -0.9592 |
| NOP58 ribonucleoprotein homolog (yeast) [Source:ZFIN;Acc:ZDB-GENE-040426-2140] | nop58 | -0.5463 |
| AveExpr | t | P.Value | adj.P.Val | B | DE |
|---|---|---|---|---|---|
| 5.029 | -11.7 | 4.143e-07 | 0.007573 | 3.307 | TRUE |
| 2.962 | 9.358 | 3.175e-06 | 0.02902 | 0.6249 | TRUE |
| -1.801 | 8.578 | 6.877e-06 | 0.0419 | -2.77 | TRUE |
| 3.746 | 5.938 | 0.0001506 | 0.4632 | -0.001555 | FALSE |
| 3.081 | 5.926 | 0.000153 | 0.4632 | -0.4563 | FALSE |
| 0.8814 | 5.827 | 0.0001746 | 0.4632 | -2.023 | FALSE |
| 1.813 | 5.815 | 0.0001774 | 0.4632 | -1.417 | FALSE |
| 2.666 | 5.421 | 0.0003049 | 0.4879 | -0.981 | FALSE |
| 5.529 | -5.398 | 0.0003148 | 0.4879 | 0.1236 | FALSE |
| 6.684 | -5.327 | 0.0003482 | 0.4879 | 0.07983 | FALSE |
Since the results do not overly change when including the cacre blocking variable, this provides evidence that the nacre contamination does not overly change the differental expression tests.
A comparison of the ranking statistic (sign(logFC) x -log10(p-value)) with and without blocking did not overly change the most significant genes.
topTable_nonacreBlock %>%
mutate(rankstat_noBlock = sign(logFC)*-log10(P.Value)) %>%
dplyr::select(external_gene_name, rankstat_noBlock) %>%
left_join(topTable_nonacreBlock %>%
mutate(rankstat_Block = sign(logFC)*-log10(P.Value)) %>%
dplyr::select(external_gene_name, rankstat_Block)) %>%
ggplot(aes(rankstat_noBlock, rankstat_Block )) +
geom_point(alpha = 0.5) +
labs(x = "-log10(p-value) x sign(logFC) no blocking",
y = "-log10(p-value) x sign(logFC) with blocking")
All my other DE analysis was performed with edgeR, so I wil try performing the blocking with that as well (following section 3.4.2 in the edgeRUsersGuide()).
design_nacre <- model.matrix(~nacre_factor + group, data = x$samples)
glmCQN_nacre <- glmFit(x, design_nacre)
topTableCQN_nacre <- glmCQN_nacre %>%
glmLRT(coef = "groupFMR1") %>%
topTags(n = Inf, adjust.method = "fdr", sort.by = "p") %>%
.$table %>%
as_tibble() %>%
mutate(DE = FDR < 0.05)
glmCQN_nacre %>%
glmLRT(coef = "groupFMR1") %>%
topTags(n = Inf, adjust.method = "fdr", sort.by = "p") %>%
.$table %>%
as_tibble() %>%
mutate(DE = FDR < 0.05)
## # A tibble: 18,280 x 10
## ensembl_gene_id location description external_gene_n… logFC logCPM LR
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 ENSDARG0000011… 7:7653-… NULL FO704772.1 -2.25 4.45 29.0
## 2 ENSDARG0000004… 14:1710… smoothelin… smtnl1 2.59 3.32 28.8
## 3 ENSDARG0000003… 14:2015… fragile X … fmr1 -1.06 5.15 17.0
## 4 ENSDARG0000011… 6:28203… zinc finge… CZQB01141835.1 -1.02 4.81 14.2
## 5 ENSDARG0000011… 7:7658-… NULL FO704772.2 -1.84 4.07 13.2
## 6 ENSDARG0000009… 20:1412… liver-enri… leg1.1 1.96 6.62 12.1
## 7 ENSDARG0000009… 20:1398… liver-enri… leg1.2 1.73 2.70 11.1
## 8 ENSDARG0000008… 22:1919… F-box prot… fbxo42 -1.35 3.20 10.9
## 9 ENSDARG0000002… 16:2644… erythrocyt… epb41b 2.00 4.02 10.4
## 10 ENSDARG0000007… 14:4559… zgc:194285… zgc:194285 2.49 2.11 10.3
## # … with 18,270 more rows, and 3 more variables: PValue <dbl>, FDR <dbl>,
## # DE <lgl>
When including the level of nacre contamination in the model, only 2 genes were found to be DE due to fmr1 genotype. However, this method does not model the level of nacre contamination as a random effect, so this may not be completely correct.
In summary, the contamination of nacre allele probably doesnt overall effect the transcriptome too much.
This part was taken from Stephen and Melanie’s FMR1_Analysis.Rmd file:
For each of the 3 DE genes, a brief analysis was performed using the Ensembl BLAST server to search all cDNAs within the zebrafish transcriptome, with parameters set for distant homology. For each gene the longest transcript was chosen as the query sequence. Only results with an E-value < 1 were retained.
Note that the 3 DE genes were found from the analysis using limma. These genes were also detected using edgeR.
summarise() ungrouping output (override with .groups argument)
| Gene hit | description | Total Matches | Longest Match |
|---|---|---|---|
| fxr1 | fragile X mental retardation, autosomal homolog 1 | 4 | 1359 |
| fxr2 | fragile X mental retardation, autosomal homolog 2 | 8 | 977 |
| anks1b | ankyrin repeat and sterile alpha motif domain containing 1B | 1 | 88 |
| pimr138 | Pim proto-oncogene, serine/threonine kinase, related 138 | 1 | 87 |
| si:dkey-90l23.1 | si:dkey-90l23.1 | 1 | 84 |
| eno3 | enolase 3, (beta, muscle) | 1 | 80 |
| nr2f6a | nuclear receptor subfamily 2, group F, member 6a | 1 | 75 |
| ZBTB26 | zinc finger and BTB domain containing 26 | 1 | 72 |
| nat10 | N-acetyltransferase 10 | 1 | 46 |
| rybpa | RING1 and YY1 binding protein a | 1 | 43 |
| ikzf5 | IKAROS family zinc finger 5 | 4 | 38 |
| ctdnep1b | CTD nuclear envelope phosphatase 1b | 1 | 38 |
| gnb1b | guanine nucleotide binding protein (G protein), beta polypeptide 1b | 1 | 35 |
| acvr2aa | activin A receptor type 2Aa | 2 | 34 |
| ube2d2 | ubiquitin-conjugating enzyme E2D 2 (UBC4/5 homolog, yeast) | 2 | 31 |
| traf3ip2l | TRAF3 interacting protein 2-like | 1 | 31 |
Below is the volcano plot from edgeR with the blast hits for fmr1 highlighted.
topTableCQN %>%
mutate(`Blast Hit` = external_gene_name %in% fmr1Blast$`Gene hit`) %>%
arrange(`Blast Hit`) %>%
ggplot(aes(logFC, -log10(PValue), label = external_gene_name, colour = `Blast Hit`)) +
geom_point(alpha = 0.4) +
geom_label_repel(data = . %>% dplyr::filter(`Blast Hit`),
show.legend = FALSE,
size = 3) +
scale_colour_manual(values = c("grey30", "blue")) +
theme_bw() +
theme(aspect.ratio = 1)
#ggsave("blastRe_edgeR.png")
To more formally test for transcriptional adaptation, I will perform an enrichment analysis, treating the blast hit genes as a gene set. Using fry with a directional hypothesis, this did not reveal any significant upregulation as a group.
blasthits <-
read_csv(here::here("results/BLAST_fmr1s-txome.csv")) %>%
left_join(as.data.frame(genes), by = c(`Gene hit` = "symbol")) %>%
dplyr::select( "Gene hit" , "gene_id" ) %>%
na.omit() %>%
.$gene_id %>%
unique
logCPM %>%
fry(
index = blasthits,
design = design,
contrast = "FMR1",
sort = "directional"
) %>%
rownames_to_column("geneset") %>%
as_tibble() %>%
dplyr::select(geneset, NGenes, PValue) %>%
pander(
caption = paste(
"Enrichment analysis using fry on the blast hits. No significant upregulation is observed.",
"Tests were directional."
),
split.tables = Inf,
justify = "lrr"
)
| geneset | NGenes | PValue |
|---|---|---|
| set1 | 48 | 0.6288 |
logCPM %>%
as.data.frame() %>%
.[blasthits,] %>%
na.omit %>%
rownames_to_column("gene_id") %>%
left_join(genes %>% as_tibble) %>%
dplyr::select(colnames(x), gene_name) %>%
dplyr::distinct(gene_name, .keep_all = TRUE) %>%
column_to_rownames("gene_name") %>%
t() %>%
pheatmap(color = viridis_pal(option = "viridis")(100), cellwidth = 10, cellheight = 10,
main = "CQN-adjusted expression of blast hits")
From FMR1_Analysis.Rmd:
Although not the subject of investigation, the genes detected as DE were analysed using BLAST as a reverse viewpoint on the experiment. None of the genes contained matches to fmr1.
summarise() ungrouping output (override with .groups argument)
| Gene hit | description | Total Matches | Longest Match |
|---|---|---|---|
| smtna | smoothelin a | 1 | 297 |
| smtnb | smoothelin b | 4 | 258 |
| si:ch211-195o20.7 | si:ch211-195o20.7 | 1 | 173 |
| CU929178.1 | NUL | 1 | 134 |
| nefla | neurofilament, light polypeptide a | 3 | 132 |
| psd2 | pleckstrin and Sec7 domain containing 2 | 1 | 92 |
| si:dkey-20i20.10 | si:dkey-20i20.10 | 1 | 88 |
| taf3 | TAF3 RNA polymerase II, TATA box binding protein (TBP)-associated facto | 2 | 84 |
| tgfbr2a | transforming growth factor beta receptor 2a | 1 | 78 |
| ppp1r37 | protein phosphatase 1, regulatory subunit 37 | 1 | 76 |
| brd3b | bromodomain containing 3b | 3 | 63 |
| cmya5 | cardiomyopathy associated 5 | 2 | 61 |
| chd5 | chromodomain helicase DNA binding protein 5 | 1 | 60 |
| chd2 | chromodomain helicase DNA binding protein 2 | 2 | 58 |
| zgc:175284 | zgc:175284 | 2 | 54 |
| zgc:173726 | zgc:173726 | 6 | 52 |
| map6a | microtubule-associated protein 6a | 2 | 52 |
| si:dkey-20i20.9 | si:dkey-20i20.9 | 1 | 52 |
| znf1170 | zinc finger protein 1170 | 1 | 52 |
| stab2 | stabilin 2 | 4 | 51 |
| znf1166 | zinc finger protein 1166 | 3 | 51 |
| igfn1.1 | immunoglobulin-like and fibronectin type III domain containing 1, tandem duplicate 1 | 1 | 49 |
| zswim8 | zinc finger, SWIM-type containing 8 | 4 | 47 |
| kcnma1a | potassium large conductance calcium-activated channel, subfamily M, alpha member 1a | 3 | 44 |
| gpbp1l1 | GC-rich promoter binding protein 1-like 1 | 2 | 44 |
| foxn2b | forkhead box N2b | 1 | 43 |
| polr2f | polymerase (RNA) II (DNA directed) polypeptide F | 1 | 42 |
| lmod1a | leiomodin 1a (smooth muscle) | 1 | 41 |
| nktr | natural killer cell triggering receptor | 1 | 41 |
| im:7152348 | im:7152348 | 1 | 39 |
| jak2b | Janus kinase 2b | 1 | 35 |
| nop58 | NOP58 ribonucleoprotein homolog (yeast) | 1 | 35 |
| zfr2 | zinc finger RNA binding protein 2 | 2 | 29 |
| chd3 | chromodomain helicase DNA binding protein 3 | 3 | 23 |
Volcano plot showing closest genes from the BLAST search on smtnl1
One gene (nop58) with a BLAST match to smtnl narrowly missed the threshold for being called DE. However, the match was only 35bp.
summarise() ungrouping output (override with .groups argument)
| Gene hit | description | Total Matches | Longest Match |
|---|---|---|---|
| si:ch211-262h13.3 | si:ch211-262h13.3 | 5 | 268 |
| si:ch211-244k5.1 | si:ch211-244k5.1 | 1 | 267 |
| si:dkey-242h9.3 | si:dkey-242h9.3 | 1 | 70 |
| or117-1 | odorant receptor, family F, subfamily 117, member 1 | 1 | 53 |
| cib1 | calcium and integrin binding 1 (calmyrin) | 2 | 39 |
| im:7147486 | im:7147486 | 2 | 35 |
| ppp1r18 | protein phosphatase 1, regulatory subunit 18 | 1 | 35 |
Volcano plot showing closest genes from the BLAST search on smtnl1
From FMR1_Analysis.Rmd: At this stage, no clear evidence can be found for genetic compensation in this dataset. Alternative strategies were briefly investigated using pair-wise alignments between all DE genes, but no compelling results were found. As no specific evidence exists for 3’UTR, 5’UTR or other non-coding regions in compensation, no analysis was conducted on these elements.
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Mojave 10.14.3
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
##
## attached base packages:
## [1] grid stats4 splines parallel stats graphics grDevices
## [8] utils datasets methods base
##
## other attached packages:
## [1] ensembldb_2.12.1 AnnotationFilter_1.12.0
## [3] GenomicFeatures_1.40.1 AnnotationDbi_1.50.3
## [5] Biobase_2.48.0 BSgenome.Drerio.UCSC.danRer11_1.4.2
## [7] BSgenome_1.56.0 rtracklayer_1.48.0
## [9] Biostrings_2.56.0 XVector_0.28.0
## [11] Gviz_1.32.0 GenomicRanges_1.40.0
## [13] GenomeInfoDb_1.24.2 IRanges_2.22.2
## [15] S4Vectors_0.26.1 biomaRt_2.44.4
## [17] pheatmap_1.0.12 viridis_0.5.1
## [19] viridisLite_0.3.0 ggpubr_0.4.0
## [21] pander_0.6.3 pathview_1.28.1
## [23] cowplot_1.1.0 ggfortify_0.4.11
## [25] cqn_1.34.0 quantreg_5.75
## [27] SparseM_1.78 preprocessCore_1.50.0
## [29] nor1mix_1.3-0 mclust_5.4.7
## [31] ggeasy_0.1.2 ggrepel_0.8.2
## [33] DT_0.16 scales_1.1.1
## [35] goseq_1.40.0 geneLenDataBase_1.24.0
## [37] BiasedUrn_1.07 msigdbr_7.2.1
## [39] fgsea_1.14.0 AnnotationHub_2.20.2
## [41] BiocFileCache_1.12.1 dbplyr_2.0.0
## [43] BiocGenerics_0.34.0 edgeR_3.30.3
## [45] limma_3.44.3 forcats_0.5.0
## [47] stringr_1.4.0 dplyr_1.0.2
## [49] purrr_0.3.4 readr_1.4.0
## [51] tidyr_1.1.2 tibble_3.0.4
## [53] ggplot2_3.3.2 tidyverse_1.3.0
## [55] magrittr_2.0.1
##
## loaded via a namespace (and not attached):
## [1] utf8_1.1.4 tidyselect_1.1.0
## [3] RSQLite_2.2.1 htmlwidgets_1.5.2
## [5] BiocParallel_1.22.0 munsell_0.5.0
## [7] statmod_1.4.35 withr_2.3.0
## [9] colorspace_2.0-0 highr_0.8
## [11] knitr_1.30 rstudioapi_0.13
## [13] ggsignif_0.6.0 labeling_0.4.2
## [15] KEGGgraph_1.48.0 GenomeInfoDbData_1.2.3
## [17] farver_2.0.3 bit64_4.0.5
## [19] rprojroot_2.0.2 vctrs_0.3.5
## [21] generics_0.1.0 xfun_0.19
## [23] biovizBase_1.36.0 R6_2.5.0
## [25] locfit_1.5-9.4 bitops_1.0-6
## [27] DelayedArray_0.14.1 assertthat_0.2.1
## [29] promises_1.1.1 nnet_7.3-14
## [31] gtable_0.3.0 conquer_1.0.2
## [33] rlang_0.4.9 MatrixModels_0.4-1
## [35] lazyeval_0.2.2 rstatix_0.6.0
## [37] dichromat_2.0-0 checkmate_2.0.0
## [39] broom_0.7.2 BiocManager_1.30.10
## [41] yaml_2.2.1 abind_1.4-5
## [43] modelr_0.1.8 crosstalk_1.1.0.1
## [45] backports_1.2.0 httpuv_1.5.4
## [47] Hmisc_4.4-1 tools_4.0.2
## [49] ellipsis_0.3.1 RColorBrewer_1.1-2
## [51] Rcpp_1.0.5 base64enc_0.1-3
## [53] progress_1.2.2 zlibbioc_1.34.0
## [55] RCurl_1.98-1.2 prettyunits_1.1.1
## [57] rpart_4.1-15 openssl_1.4.3
## [59] cluster_2.1.0 SummarizedExperiment_1.18.2
## [61] haven_2.3.1 here_1.0.0
## [63] fs_1.5.0 data.table_1.13.2
## [65] openxlsx_4.2.3 reprex_0.3.0
## [67] ProtGenerics_1.20.0 matrixStats_0.57.0
## [69] hms_0.5.3 mime_0.9
## [71] evaluate_0.14 xtable_1.8-4
## [73] XML_3.99-0.5 rio_0.5.16
## [75] jpeg_0.1-8.1 readxl_1.3.1
## [77] gridExtra_2.3 compiler_4.0.2
## [79] crayon_1.3.4 htmltools_0.5.0
## [81] mgcv_1.8-33 later_1.1.0.1
## [83] Formula_1.2-4 lubridate_1.7.9.2
## [85] DBI_1.1.0 corrplot_0.84
## [87] rappdirs_0.3.1 Matrix_1.2-18
## [89] car_3.0-10 cli_2.2.0
## [91] pkgconfig_2.0.3 GenomicAlignments_1.24.0
## [93] foreign_0.8-80 xml2_1.3.2
## [95] rvest_0.3.6 VariantAnnotation_1.34.0
## [97] digest_0.6.27 graph_1.66.0
## [99] rmarkdown_2.5 cellranger_1.1.0
## [101] fastmatch_1.1-0 htmlTable_2.1.0
## [103] curl_4.3 shiny_1.5.0
## [105] Rsamtools_2.4.0 lifecycle_0.2.0
## [107] nlme_3.1-150 jsonlite_1.7.1
## [109] carData_3.0-4 askpass_1.1
## [111] fansi_0.4.1 pillar_1.4.7
## [113] lattice_0.20-41 KEGGREST_1.28.0
## [115] fastmap_1.0.1 httr_1.4.2
## [117] survival_3.2-7 GO.db_3.11.4
## [119] interactiveDisplayBase_1.26.3 glue_1.4.2
## [121] zip_2.1.1 png_0.1-7
## [123] BiocVersion_3.11.1 bit_4.0.4
## [125] Rgraphviz_2.32.0 stringi_1.5.3
## [127] blob_1.2.1 org.Hs.eg.db_3.11.4
## [129] latticeExtra_0.6-29 memoise_1.1.0